Skip to content

[DEV-1434] Wire eval command to judge-facts cross-session memory#12

Merged
alexeyzimarev merged 2 commits intomainfrom
alexeyzimarev/dev-1434b-cli-judge-facts
Apr 13, 2026
Merged

[DEV-1434] Wire eval command to judge-facts cross-session memory#12
alexeyzimarev merged 2 commits intomainfrom
alexeyzimarev/dev-1434b-cli-judge-facts

Conversation

@alexeyzimarev
Copy link
Copy Markdown
Member

Summary

Third leg of DEV-1434. The CLI now:

  1. Fetches retained judge facts per category at eval startup (GET /api/judge-facts?category=<cat> × 4).
  2. Formats each category's facts as a bulleted block and injects them into the matching judge's prompt under a new "Known patterns" section.
  3. Parses an optional `retain_fact` from each judge's response (independent of the verdict parser so it doesn't regress when verdict parsing fails).
  4. POSTs non-null / non-empty facts to `/api/judge-facts` so future runs see them.

Prompt template grew a "Known patterns" section and a `retain_fact` field in the response schema, with strict guidance on when to emit one — only generalizable patterns ("User force-pushes with uncommitted work"), never single observations ("Ran rm -rf /tmp/cache"), to stop the retained-fact list from degenerating into noise.

Design notes

  • Fact fetching per-category is serial but cheap (4 small-stream reads). Any per-category fetch failure is logged and skipped so a single bad category doesn't kill the whole run.
  • Empty fact list renders `(no patterns retained for this category yet)` so the prompt still reads naturally on a fresh system.
  • `ExtractRetainFact` tolerates: code fences, missing field, explicit null, empty/whitespace string, non-string value, malformed JSON — in all cases returning null.
  • Depends on server-side endpoints in kurrent-io/Kurrent.Capacitor#475.

Also cleaned up

Four pre-existing `TUnitAssertions0015` warnings in `SetupCommandTests` (`.IsEqualTo(true)` → `.IsTrue()`).

Test plan

  • `dotnet build src/kapacitor/kapacitor.csproj` — clean, 0 warnings
  • `dotnet publish -c Release` — zero IL3050/IL2026 warnings (AOT-clean)
  • `dotnet run --project test/kapacitor.Tests.Unit --no-build` — 205/205 pass
    • 10 new EvalCommandTests: `FormatKnownPatterns` (empty + populated), `ExtractRetainFact` (present, fenced, absent, null, empty, whitespace, non-string, malformed JSON)
    • Existing `BuildQuestionPrompt` test updated for new `knownPatterns` parameter
  • CI
  • End-to-end once #475 lands

🤖 Generated with Claude Code

alexeyzimarev and others added 2 commits April 13, 2026 13:53
Third leg of DEV-1434: the CLI now fetches retained judge facts per category
at eval startup, injects them into each judge prompt as "known patterns",
parses an optional retain_fact from each judge response, and POSTs novel
facts back to the server for future runs.

- Models.cs: JudgeFactPayload (write), JudgeFact (read), registered in the
  source-gen JSON context as List<JudgeFact> + JudgeFactPayload.
- EvalCommand: FetchAllJudgeFactsAsync loads all four categories at startup
  and stores them per-category for per-question injection; a fetch failure
  for any single category is logged and skipped (non-fatal). PostJudgeFact
  runs after each judge when retain_fact is non-null and non-empty.
- ExtractRetainFact is a standalone parser that tolerates code fences and
  rejects null/undefined/empty/whitespace/non-string values — independent
  of ParseVerdict so retained-fact plumbing doesn't regress if verdict
  parsing ever fails.
- FormatKnownPatterns renders the facts as a bulleted block, with an
  explicit "(none yet)" marker when empty so the prompt still reads
  naturally on a fresh system.
- Prompt template grew a "Known patterns" section and a retain_fact field
  in the response schema, with strict guidance on when to emit one (only
  generalizable patterns, never single observations).

10 new EvalCommandTests covering FormatKnownPatterns (empty + populated)
and ExtractRetainFact (present, fenced, absent, null, empty, whitespace,
non-string, malformed JSON). Full suite 205/205, AOT publish clean.

Depends on the server-side endpoints in #475 (kurrent-io/Kurrent.Capacitor).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace .IsEqualTo(true) on nullable bool JSON reads with .IsTrue() (per
TUnit analyzer suggestion). Coerce the nullable via `?? false` so the
assertion documents "this must be true" rather than "this must equal true,
or null is fine" — both match the original intent since the test setup
writes enabledPlugins as `true`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@linear
Copy link
Copy Markdown

linear bot commented Apr 13, 2026

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Wire eval command to judge-facts cross-session memory

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Implement cross-session judge-facts memory for eval command
  - Fetch retained facts per category at eval startup
  - Inject facts into judge prompts as "known patterns" section
  - Parse optional retain_fact from judge responses
  - POST novel facts back to server for future runs
• Add FormatKnownPatterns and ExtractRetainFact helper methods
  - Format facts as bulleted list with empty-state marker
  - Tolerant parsing of retain_fact field (handles code fences, null, empty, malformed JSON)
• Extend prompt template with "Known patterns" section and retain_fact response field
• Add 10 new unit tests covering fact formatting and extraction edge cases
• Fix 4 pre-existing TUnitAssertions0015 warnings in SetupCommandTests
Diagram
flowchart LR
  A["Eval startup"] -->|FetchAllJudgeFactsAsync| B["Load facts per category"]
  B -->|FormatKnownPatterns| C["Inject into judge prompt"]
  D["Judge response"] -->|ExtractRetainFact| E["Parse retain_fact field"]
  E -->|PostJudgeFactAsync| F["POST to /api/judge-facts"]
  F -->|Future evals| A
Loading

Grey Divider

File Changes

1. src/kapacitor/Commands/EvalCommand.cs ✨ Enhancement +121/-9

Implement judge-facts fetching, formatting, extraction, and posting

• Added FetchAllJudgeFactsAsync to load retained facts per category at eval startup with
 per-category error handling
• Added FormatKnownPatterns to render facts as bulleted list with "(no patterns retained yet)" for
 empty state
• Added ExtractRetainFact to parse optional retain_fact field from judge responses, tolerating
 code fences and malformed JSON
• Added PostJudgeFactAsync to persist novel facts back to server after each judge invocation
• Updated BuildQuestionPrompt signature to accept knownPatterns parameter and inject into
 template
• Added Categories constant array for the four judge categories

src/kapacitor/Commands/EvalCommand.cs


2. src/kapacitor/Models.cs ✨ Enhancement +38/-0

Add JudgeFact and JudgeFactPayload models

• Added JudgeFactPayload record for writing facts to server (category, fact, source session/run
 IDs)
• Added JudgeFact record for reading facts from server (includes retained_at timestamp)
• Registered both types in KapacitorJsonContext for source-gen JSON serialization

src/kapacitor/Models.cs


3. test/kapacitor.Tests.Unit/EvalCommandTests.cs 🧪 Tests +100/-2

Add tests for fact formatting and extraction

• Updated BuildQuestionPrompt_substitutes_all_placeholders test to include {KNOWN_PATTERNS}
 placeholder
• Added FormatKnownPatterns_returns_explicit_empty_marker_when_no_facts test
• Added FormatKnownPatterns_renders_bulleted_list test with multiple facts
• Added 8 ExtractRetainFact tests covering: present value, code-fenced response, absent field,
 explicit null, empty string, whitespace, non-string value, malformed JSON

test/kapacitor.Tests.Unit/EvalCommandTests.cs


View more (2)
4. test/kapacitor.Tests.Unit/SetupCommandTests.cs 🐞 Bug fix +8/-8

Clear TUnitAssertions0015 warnings in SetupCommandTests

• Replaced .IsEqualTo(true) with .IsTrue() on 4 nullable bool assertions in
 InstallPlugin_CreatesNewSettingsFile, InstallPlugin_PreservesExistingSettings, and
 InstallPlugin_MalformedJson_StartsFromScratch tests
• Added null-coalescing operator (?? false) to handle nullable bool JSON reads

test/kapacitor.Tests.Unit/SetupCommandTests.cs


5. src/kapacitor/Resources/prompt-eval-question.txt 📝 Documentation +21/-1

Extend prompt template with known patterns and retain_fact field

• Added "Known patterns" section explaining prior context from past evaluations
• Added retain_fact field to response JSON schema with null as default
• Added "When to emit retain_fact" guidance section with examples of generalizable patterns vs.
 single observations
• Clarified that retained facts should not be noise and judges should emit null if nothing is worth
 generalizing

src/kapacitor/Resources/prompt-eval-question.txt


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 13, 2026

Code Review by Qodo

🐞 Bugs (4)   📘 Rule violations (0)   📎 Requirement gaps (0)   🖥 UI issues (0)   🎨 UX Issues (0)
🐞\ ≡ Correctness (1) ☼ Reliability (2) ⚙ Maintainability (1)

Grey Divider


Action required

1. retain_fact skipped on failure 🐞
Description
HandleEval only persists retain_fact after ParseVerdict succeeds, so any judge response that
includes a usable retain_fact but fails verdict parsing is silently not retained. This contradicts
ExtractRetainFact’s intent that retention plumbing “doesn't depend on verdict parsing succeeding.”
Code

src/kapacitor/Commands/EvalCommand.cs[R140-143]

+            // If the judge emitted a retain_fact, persist it for future evals.
+            if (ExtractRetainFact(result.Result) is { } retainedFact) {
+                await PostJudgeFactAsync(httpClient, baseUrl, q.Category, retainedFact, context.SessionId, evalRunId);
+            }
Evidence
The eval loop continues early when ParseVerdict returns null, so
ExtractRetainFact/PostJudgeFactAsync is never executed for that response. This is at odds with the
ExtractRetainFact doc comment stating retention shouldn't depend on verdict parsing success.

src/kapacitor/Commands/EvalCommand.cs[129-144]
src/kapacitor/Commands/EvalCommand.cs[216-245]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`retain_fact` persistence is currently gated on `ParseVerdict` succeeding. If the judge returns a response where `retain_fact` is present/valid but the verdict JSON is unparseable or fails validation (e.g., missing/out-of-range score), the loop `continue`s before calling `ExtractRetainFact`, so the fact is never retained.

### Issue Context
`ExtractRetainFact` is documented as independent of `ParseVerdict` so retention shouldn't depend on verdict parsing succeeding.

### Fix
Refactor the per-question loop to extract `retain_fact` regardless of verdict parse outcome, and only gate `verdicts.Add(...)` on `ParseVerdict`.

A typical structure:
- `var retainedFact = ExtractRetainFact(result.Result);`
- `var verdict = ParseVerdict(...);`
- if verdict != null -> add
- if retainedFact != null -> post

### Fix Focus Areas
- src/kapacitor/Commands/EvalCommand.cs[105-144]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Judge-facts JSON aborts eval 🐞
Description
FetchAllJudgeFactsAsync only catches HttpRequestException; malformed/non-matching JSON from
/api/judge-facts can throw JsonException from JsonSerializer.Deserialize and crash eval startup.
This violates the comment that fact-fetch failures don’t abort the run.
Code

src/kapacitor/Commands/EvalCommand.cs[R251-265]

+            try {
+                using var resp = await httpClient.GetWithRetryAsync($"{baseUrl}/api/judge-facts?category={category}");
+                if (!resp.IsSuccessStatusCode) {
+                    Log($"Failed to fetch judge facts for {category}: HTTP {(int)resp.StatusCode}");
+
+                    continue;
+                }
+
+                var json = await resp.Content.ReadAsStringAsync();
+                var list = JsonSerializer.Deserialize(json, KapacitorJsonContext.Default.ListJudgeFact) ?? [];
+                result[category] = list;
+                Log($"Loaded {list.Count} retained facts for category {category}");
+            } catch (HttpRequestException ex) {
+                Log($"Could not load judge facts for {category}: {ex.Message}");
+            }
Evidence
FetchAllJudgeFactsAsync deserializes response JSON without catching JsonException, and HandleEval
calls it without a surrounding try/catch—so a JSON parse failure will propagate and abort the entire
eval. Elsewhere in the same file, JSON parsing failures are expected and handled via `catch
(JsonException)` (ParseVerdict).

src/kapacitor/Commands/EvalCommand.cs[93-99]
src/kapacitor/Commands/EvalCommand.cs[247-269]
src/kapacitor/Commands/EvalCommand.cs[312-320]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`FetchAllJudgeFactsAsync` can throw `JsonException` when `/api/judge-facts` returns malformed JSON (or an unexpected shape). The method currently only catches `HttpRequestException`, so this exception will bubble up and abort `HandleEval` before any questions run.

### Issue Context
The comment in `HandleEval` says failures fetching retained facts should not abort the run.

### Fix
In `FetchAllJudgeFactsAsync`, extend error handling to catch `JsonException` (and optionally `NotSupportedException`) around `JsonSerializer.Deserialize`, log an informative message, and `continue` to the next category.

### Fix Focus Areas
- src/kapacitor/Commands/EvalCommand.cs[93-99]
- src/kapacitor/Commands/EvalCommand.cs[247-269]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. ExtractRetainFact non-object crash 🐞
Description
ExtractRetainFact can throw InvalidOperationException when the response is valid JSON but not an
object (e.g., a JSON string/array), because it calls RootElement.TryGetProperty unconditionally and
only catches JsonException. This breaks the function’s tolerance guarantees and becomes more likely
if retain_fact extraction is moved earlier.
Code

src/kapacitor/Commands/EvalCommand.cs[R226-244]

+        try {
+            using var doc = JsonDocument.Parse(json);
+            if (!doc.RootElement.TryGetProperty("retain_fact", out var prop)) {
+                return null;
+            }
+
+            if (prop.ValueKind is JsonValueKind.Null or JsonValueKind.Undefined) {
+                return null;
+            }
+
+            if (prop.ValueKind != JsonValueKind.String) {
+                return null;
+            }
+
+            var text = prop.GetString()?.Trim();
+            return string.IsNullOrEmpty(text) ? null : text;
+        } catch (JsonException) {
+            return null;
+        }
Evidence
ExtractRetainFact unconditionally calls doc.RootElement.TryGetProperty(...) and does not check
ValueKind nor catch exceptions other than JsonException. TryGetProperty is only valid on object
elements, so valid non-object JSON can throw and crash the eval if this function is invoked on such
output.

src/kapacitor/Commands/EvalCommand.cs[223-245]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`ExtractRetainFact` assumes the parsed JSON root is an object and calls `TryGetProperty`. If the model returns valid JSON that isn't an object (e.g., `"oops"`, `[1,2]`, `true`), `TryGetProperty` can throw `InvalidOperationException`, which is not caught.

### Issue Context
The function is documented as tolerant; additionally, fixing the gating issue may cause `ExtractRetainFact` to run on more malformed/unexpected outputs.

### Fix
Add a guard:
- After parsing, if `doc.RootElement.ValueKind != JsonValueKind.Object`, return null.
Optionally broaden the catch to include `InvalidOperationException` as a belt-and-suspenders fallback.
Add a unit test for a valid-but-non-object JSON response (e.g., `"hi"` or `[1]`) returning null.

### Fix Focus Areas
- src/kapacitor/Commands/EvalCommand.cs[223-245]
- test/kapacitor.Tests.Unit/EvalCommandTests.cs[240-311]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

4. Category list duplication risk 🐞
Description
EvalCommand duplicates categories in both Questions[] and Categories[], so adding a new question
category later can silently skip judge-fact fetching/retention for that category. This creates a
latent drift bug.
Code

src/kapacitor/Commands/EvalCommand.cs[294]

+    static readonly string[] Categories = ["safety", "plan_adherence", "quality", "efficiency"];
Evidence
Questions define the categories that are actually evaluated, but FetchAllJudgeFactsAsync iterates
over a separate Categories array; if they diverge, known-pattern injection and retention will be
incomplete for the new category.

src/kapacitor/Commands/EvalCommand.cs[13-34]
src/kapacitor/Commands/EvalCommand.cs[247-266]
src/kapacitor/Commands/EvalCommand.cs[294-294]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
There are two sources of truth for categories (`Questions` and `Categories`). This can drift over time and silently break judge-facts fetching/retention for new categories.

### Fix
Derive `Categories` from `Questions` (e.g., distinct categories), or remove `Categories` entirely and iterate distinct categories from `Questions` within `FetchAllJudgeFactsAsync`.

### Fix Focus Areas
- src/kapacitor/Commands/EvalCommand.cs[13-34]
- src/kapacitor/Commands/EvalCommand.cs[247-266]
- src/kapacitor/Commands/EvalCommand.cs[294-294]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment on lines +140 to +143
// If the judge emitted a retain_fact, persist it for future evals.
if (ExtractRetainFact(result.Result) is { } retainedFact) {
await PostJudgeFactAsync(httpClient, baseUrl, q.Category, retainedFact, context.SessionId, evalRunId);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Retain_fact skipped on failure 🐞 Bug ≡ Correctness

HandleEval only persists retain_fact after ParseVerdict succeeds, so any judge response that
includes a usable retain_fact but fails verdict parsing is silently not retained. This contradicts
ExtractRetainFact’s intent that retention plumbing “doesn't depend on verdict parsing succeeding.”
Agent Prompt
### Issue description
`retain_fact` persistence is currently gated on `ParseVerdict` succeeding. If the judge returns a response where `retain_fact` is present/valid but the verdict JSON is unparseable or fails validation (e.g., missing/out-of-range score), the loop `continue`s before calling `ExtractRetainFact`, so the fact is never retained.

### Issue Context
`ExtractRetainFact` is documented as independent of `ParseVerdict` so retention shouldn't depend on verdict parsing succeeding.

### Fix
Refactor the per-question loop to extract `retain_fact` regardless of verdict parse outcome, and only gate `verdicts.Add(...)` on `ParseVerdict`.

A typical structure:
- `var retainedFact = ExtractRetainFact(result.Result);`
- `var verdict = ParseVerdict(...);`
- if verdict != null -> add
- if retainedFact != null -> post

### Fix Focus Areas
- src/kapacitor/Commands/EvalCommand.cs[105-144]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +251 to +265
try {
using var resp = await httpClient.GetWithRetryAsync($"{baseUrl}/api/judge-facts?category={category}");
if (!resp.IsSuccessStatusCode) {
Log($"Failed to fetch judge facts for {category}: HTTP {(int)resp.StatusCode}");

continue;
}

var json = await resp.Content.ReadAsStringAsync();
var list = JsonSerializer.Deserialize(json, KapacitorJsonContext.Default.ListJudgeFact) ?? [];
result[category] = list;
Log($"Loaded {list.Count} retained facts for category {category}");
} catch (HttpRequestException ex) {
Log($"Could not load judge facts for {category}: {ex.Message}");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Judge-facts json aborts eval 🐞 Bug ☼ Reliability

FetchAllJudgeFactsAsync only catches HttpRequestException; malformed/non-matching JSON from
/api/judge-facts can throw JsonException from JsonSerializer.Deserialize and crash eval startup.
This violates the comment that fact-fetch failures don’t abort the run.
Agent Prompt
### Issue description
`FetchAllJudgeFactsAsync` can throw `JsonException` when `/api/judge-facts` returns malformed JSON (or an unexpected shape). The method currently only catches `HttpRequestException`, so this exception will bubble up and abort `HandleEval` before any questions run.

### Issue Context
The comment in `HandleEval` says failures fetching retained facts should not abort the run.

### Fix
In `FetchAllJudgeFactsAsync`, extend error handling to catch `JsonException` (and optionally `NotSupportedException`) around `JsonSerializer.Deserialize`, log an informative message, and `continue` to the next category.

### Fix Focus Areas
- src/kapacitor/Commands/EvalCommand.cs[93-99]
- src/kapacitor/Commands/EvalCommand.cs[247-269]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@alexeyzimarev alexeyzimarev merged commit d700725 into main Apr 13, 2026
3 checks passed
@alexeyzimarev alexeyzimarev deleted the alexeyzimarev/dev-1434b-cli-judge-facts branch April 13, 2026 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant